Skip to content

Reject unresolvable unit symbols instead of dropping them#30

Merged
nertzy merged 1 commit into
minad:masterfrom
nertzy:strict-lexer
Jun 12, 2026
Merged

Reject unresolvable unit symbols instead of dropping them#30
nertzy merged 1 commit into
minad:masterfrom
nertzy:strict-lexer

Conversation

@nertzy

@nertzy nertzy commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • The lexer built its symbol character class only from the glyphs of registered symbols, so any other non-ASCII glyph matched no token and String#scan silently skipped it. Unknown ASCII words still matched the baseline [A-Za-z0-9_]+ class and reached validation, but unknown glyphs vanished:
    • Unit(1, 'λ') returned a unitless 1
    • Unit(1, 'm λ') silently became 1 m
    • Unit(1, 'm/λ') raised a misleading SyntaxError: Unexpected token /
  • Now the lexer tokenizes any run of non-operator, non-whitespace characters as a symbol, so every unrecognized glyph survives lexing and fails loudly as Undefined unit <symbol> through the same validation path unknown ASCII symbols already used — and the error names the real culprit (Undefined unit λ) rather than a neighboring operator.
  • As a side effect the tokenizer no longer depends on what is registered, so it is a plain constant again instead of a per-load rebuild. Registered glyphs (Ω, °C, µ, %, units loaded at runtime) still parse, and to_s round-trips are preserved.

Breaking: parsing an unrecognized unit symbol now always raises TypeError: Undefined unit … instead of sometimes silently yielding a dimensionless value. This extends the contract that unknown ASCII symbols already followed to unknown glyphs of any alphabet.

Test plan

  • bundle exec rake — both phases green (140 no-DSL + 142 DSL examples, 0 failures)
  • New specs cover: unknown ASCII symbol; unregistered glyph alone; glyph implicitly multiplied with a known unit; glyph adjacent to an operator; and a unit whose defining system has not been loaded yet (raises until loaded, then resolves)

The lexer built its symbol character-class only from registered glyphs, so
any other non-ASCII glyph matched no token and String#scan silently skipped
it. Unknown ASCII words still matched the baseline class and reached
validation, but unknown glyphs vanished: Unit(1, 'λ') returned a unitless 1,
Unit(1, 'm λ') silently became 1 m, and Unit(1, 'm/λ') raised a misleading
'Unexpected token /'.

Tokenize any run of non-operator, non-whitespace characters as a symbol so
every unrecognized glyph survives lexing and fails loudly as 'Undefined unit
<symbol>' via the same validation path unknown ASCII symbols already used,
naming the real culprit. This also makes the tokenizer independent of what is
registered, so it is a plain constant again rather than a per-load rebuild.

Add specs covering unknown ASCII symbols, unregistered glyphs (alone,
implicitly multiplied, and operator-adjacent), and a unit whose defining
system has not been loaded yet.
@nertzy nertzy merged commit 63e438e into minad:master Jun 12, 2026
4 checks passed
@nertzy nertzy deleted the strict-lexer branch June 12, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant